The analysis will be based on descriptive datasets found in city governmental websites. The main objective is to find trends and patterns that could be visible involving crimes committed and reported in the city regions.
Firstly we're going to look at both city's datasets, then do some data cleansing. Thereafter, normalizing and joining the data will be done for better analysis procedures. Since this is descriptive data, binning and counts will be made for better clarity. From there, data analysis may proceed.
Secondly we'll create some graphs for clearer storytelling.
Thirdly we will be using some geospatial analysis procedures for better descriptive analysis.
Lastly comparing similarity between two cities for statistical significance reasons.
# Packages will need for the theme
% matplotlib inline
import numpy as np
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import sys
from bokeh.plotting import *
from bokeh.models import HoverTool
from collections import OrderedDict
YEAR = 2014
PERIOD = "summer"
TYPE = "incidents"
DATA_DIR = "data"
SF = "sanfrancisco"
SEATTLE = "seattle"
# Read in our data. We've aggregated it by date already, so we don't need to worry about paging
sf = pd.read_csv('sanfrancisco_incidents_summer_2014.csv')
sf_nrows = sf.shape[0]
sf.head()
# This is for time series reasons
date = pd.to_datetime(sf['Date'])
print date.min()
print date.max()
t_delta = (date - date.min()).astype('timedelta64[D]')
sf['days'] = t_delta
sf.head(1)
# Now for the Seattle dataset
seattle = pd.read_csv('seattle_incidents_summer_2014.csv')
seattle_nrows = seattle.shape[0]
seattle.head()
# This is also for time series reasons
date = pd.to_datetime(seattle['Date Reported'])
print date.min()
print date.max()
t_delta = (date - date.min()).astype('timedelta64[D]')
seattle['days'] = t_delta
seattle.head(1)
# Grouping of categories
sf.groupby('Category').groups.keys()
# Renamimg of some features for cross analyzing between both datasets
sf.rename(columns = {"X": "Longitude", "Y": "Latitude", "PdDistrict": "District", "Descript": "Description"},
inplace=True)
# Splitting and creating new features
sf["City"] = pd.Series(['sanfrancisco']*sf.shape[0], index = sf.index)
sf["DayOfMonth"] = sf.Date.map(lambda x: int(x.split("/")[1]))
sf["Month"] = sf.Date.map(lambda x: int(x.split("/")[0]))
sf["Year"] = sf.Date.map(lambda x: int(x.split("/")[2]))
# Function that takes Time as HH:MM and returns an int representing the hour
def time_bin(t, bin_size=30):
hour, mins = t.split(":")
return int(hour)
sf["TimeBin"] = sf.Time.map(time_bin)
# Drop features that are obsolete
sf.drop(["IncidntNum", "PdId", "Location", "Resolution"], axis=1, inplace=True, errors="ignore")
sf.head(1)
# Reindex the DataFrame
columns_index = ["City", "DayOfMonth", "DayOfWeek", "Month", "Year", "Time", "TimeBin",
"Longitude", "Latitude", "Address", "District",
"Category", "Description"
]
sf_normalized = sf[columns_index]
sf_normalized.head(1)
# Lets drop some columns...
seattle_drop_columns = ["RMS CDW ID",
"General Offense Number",
"Offense Code",
"Offense Code Extension",
"Summary Offense Code",
"Census Tract 2000",
"Location"
]
seattle.drop(seattle_drop_columns, axis=1, inplace=True, errors="ignore")
# and rename some
seattle.rename(columns = {"Hundred Block Location": "Address", "Offense Type": "Category",
"Summarized Offense Description": "Description",
"District/Sector": "District"},
inplace=True)
# Here we normalize seattle's dataset into sf's (City, DayOfMonth, DayOfWeek, Time, TimeBin)
seattle["City"] = pd.Series(['seattle'] * seattle.shape[0], index=seattle.index)
seattle["DayOfMonth"] = seattle["Occurred Date or Date Range Start"].map(lambda x: int(x.split(" ")[0].split("/")[1]))
seattle["DayOfWeek"] = seattle["Occurred Date or Date Range Start"] \
.map(lambda x: datetime.datetime.strptime(x.split(" ")[0], "%m/%d/%Y").strftime("%A"))
seattle["Time"] = seattle["Occurred Date or Date Range Start"] \
.map(lambda x: datetime.datetime.strptime(x[11:], "%I:%M:%S %p").strftime("%H:%M"))
seattle["TimeBin"] = seattle.Time.map(time_bin)
# Reindex the DataFrame
seattle_normalized = seattle[columns_index]
seattle_normalized.sample(n=5)
# Normalizing category names (Seattle category names into SF category names)
category_subset = {
"RECKLESS BURNING": "ARSON",
"THEFT-OTH": "LARCENY/THEFT",
"BURGLARY-FORCE-NONRES": "BURGLARY",
"INJURY - ACCIDENTAL": "OTHER OFFENSES",
"ANIMAL-BITE": "OTHER OFFENSES",
"ANIMAL-CRUELTY": "OTHER OFFENSES",
"ANIMAL-OTH": "OTHER OFFENSES",
"ASSLT-AGG-BODYFORCE": "ASSAULT",
"ASSLT-AGG-GUN": "ASSAULT",
"ASSLT-AGG-POLICE-BODYFORCE": "ASSAULT",
"ASSLT-AGG-POLICE-GUN": "ASSAULT",
"ASSLT-AGG-POLICE-WEAPON": "ASSAULT",
"ASSLT-AGG-WEAPON": "ASSAULT",
"ASSLT-NONAGG": "ASSAULT",
"ASSLT-NONAGG-POLICE": "ASSAULT",
"BIAS INCIDENT": "NON-CRIMINAL",
"BURGLARY-FORCE-RES": "BURGLARY",
"BURGLARY-NOFORCE-NONRES": "BURGLARY",
"BURGLARY-NOFORCE-RES": "BURGLARY",
"BURGLARY-SECURE PARKING-NONRES": "BURGLARY",
"BURGLARY-SECURE PARKING-RES": "BURGLARY",
"COUNTERFEIT": "FORGERY/COUNTERFEITING",
"DISPUTE-CIVIL PROPERTY (AUTO)": "NON-CRIMINAL",
"DISPUTE-CIVIL PROPERTY (NON AU": "NON-CRIMINAL",
"DISPUTE-OTH": "NON-CRIMINAL",
"DISTURBANCE-NOISE": "NON-CRIMINAL",
"DISTURBANCE-OTH": "NON-CRIMINAL",
"DRIVE-BY": "OTHER OFFENSES",
"DUI-DRUGS": "DRIVING UNDER THE INFLUENCE",
"DUI-LIQUOR": "DRIVING UNDER THE INFLUENCE",
"ELUDING-FELONY FLIGHT": "RUNAWAY",
"EMBEZZLE": "EMBEZZLEMENT",
"ENDANGERMENT": "OTHER OFFENSES",
"ESCAPE": "RUNAWAY",
"FALSE REPORT": "OTHER OFFENSES",
"FIREWORK-POSSESS": "OTHER OFFENSES",
"FIREWORK-USE": "OTHER OFFENSES",
"FORGERY-CHECK": "FORGERY/COUNTERFEITING",
"FORGERY-CREDIT CARD": "FORGERY/COUNTERFEITING",
"FORGERY-OTH": "FORGERY/COUNTERFEITING",
"FRAUD-CHECK": "FRAUD",
"FRAUD-COMPUTER": "FRAUD",
"FRAUD-CREDIT CARD": "FRAUD",
"FRAUD-IDENTITY THEFT": "FRAUD",
"FRAUD-OTHER": "FRAUD",
"FRAUD-WIRE-ELECTRONIC": "FRAUD",
"HARASSMENT": "DISORDERLY CONDUCT",
"HOMICIDE-JUST-GUN": "ASSAULT",
"HOMICIDE-JUST-WEAPON": "ASSAULT",
"HOMICIDE-PREMEDITATED-GUN": "ASSAULT",
"ILLEGAL DUMPING": "NON-CRIMINAL",
"INJURY - OTHER": "OTHER OFFENSES",
"LIQUOR LAW VIOLATION": "LIQUOR LAWS",
"MALICIOUS HARASSMENT": "DISORDERLY CONDUCT",
"NARC-DRUG TRAFFIC LOITERING": "DRUG/NARCOTIC",
"NARC-EQUIPMENT/PARAPHENALIA": "DRUG/NARCOTIC",
"NARC-FORGERY-PRESCRIPTION": "DRUG/NARCOTIC",
"NARC-FOUND-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-FOUND-COCAINE": "DRUG/NARCOTIC",
"NARC-FOUND-HEROIN": "DRUG/NARCOTIC",
"NARC-FOUND-MARIJU": "DRUG/NARCOTIC",
"NARC-FOUND-METH": "DRUG/NARCOTIC",
"NARC-FOUND-OPIUM": "DRUG/NARCOTIC",
"NARC-FOUND-OTHER": "DRUG/NARCOTIC",
"NARC-FOUND-SYNTHETIC": "DRUG/NARCOTIC",
"NARC-FRAUD-PRESCRIPTION": "DRUG/NARCOTIC",
"NARC-POSSESS-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-POSSESS-COCAINE": "DRUG/NARCOTIC",
"NARC-POSSESS-HALLUCINOGEN": "DRUG/NARCOTIC",
"NARC-POSSESS-HEROIN": "DRUG/NARCOTIC",
"NARC-POSSESS-MARIJU": "DRUG/NARCOTIC",
"NARC-POSSESS-METH": "DRUG/NARCOTIC",
"NARC-POSSESS-OTHER": "DRUG/NARCOTIC",
"NARC-POSSESS-PILL/TABLET": "DRUG/NARCOTIC",
"NARC-PRODUCE-MARIJU": "DRUG/NARCOTIC",
"NARC-SELL-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-SELL-COCAINE": "DRUG/NARCOTIC",
"NARC-SELL-HEROIN": "DRUG/NARCOTIC",
"NARC-SELL-MARIJU": "DRUG/NARCOTIC",
"NARC-SELL-METH": "DRUG/NARCOTIC",
"NARC-SELL-SYNTHETIC": "DRUG/NARCOTIC",
"NARC-SMUGGLE-OTHER": "DRUG/NARCOTIC",
"OBSTRUCT": "OTHER OFFENSES",
"PORNOGRAPHY-OBSCENE MATERIAL": "PORNOGRAPHY/OBSCENE MAT",
"PROP RECOVERED-OTHER AGENCY": "STOLEN PROPERTY",
"PROPERTY DAMAGE - GRAFFITI": "VANDALISM",
"PROPERTY DAMAGE-NON RESIDENTIA": "VANDALISM",
"PROPERTY DAMAGE-RESIDENTIAL": "VANDALISM",
"PROPERTY FOUND": "STOLEN PROPERTY",
"PROPERTY LOST": "STOLEN PROPERTY",
"PROPERTY LOST - POLICE EQUIPME": "STOLEN PROPERTY",
"PROPERTY STOLEN-POSSESS": "STOLEN PROPERTY",
"PROPERTY STOLEN-SELL": "STOLEN PROPERTY",
"PROPERTY STOLEN-TRAFFICKING": "STOLEN PROPERTY",
"PROSTITUTION LOITERING": "PROSTITUTION",
"PROSTITUTION PATRONIZING": "PROSTITUTION",
"PROSTITUTION-ASSIST-PROMOTE": "PROSTITUTION",
"ROBBERY-BANK-BODYFORCE": "ROBBERY",
"ROBBERY-BANK-GUN": "ROBBERY",
"ROBBERY-BANK-WEAPON": "ROBBERY",
"ROBBERY-BUSINESS-BODYFORCE": "ROBBERY",
"ROBBERY-BUSINESS-GUN": "ROBBERY",
"ROBBERY-BUSINESS-WEAPON": "ROBBERY",
"ROBBERY-RESIDENCE-BODYFORCE": "ROBBERY",
"ROBBERY-RESIDENCE-GUN": "ROBBERY",
"ROBBERY-RESIDENCE-WEAPON": "ROBBERY",
"ROBBERY-STREET-BODYFORCE": "ROBBERY",
"ROBBERY-STREET-GUN": "ROBBERY",
"ROBBERY-STREET-WEAPON": "ROBBERY",
"THEFT OF SERVICES": "LARCENY/THEFT",
"THEFT-AUTO PARTS": "LARCENY/THEFT",
"THEFT-AUTOACC": "LARCENY/THEFT",
"THEFT-BICYCLE": "LARCENY/THEFT",
"THEFT-BOAT": "LARCENY/THEFT",
"THEFT-BUILDING": "LARCENY/THEFT",
"THEFT-CARPROWL": "LARCENY/THEFT",
"THEFT-COINOP": "LARCENY/THEFT",
"THEFT-LICENSE PLATE": "LARCENY/THEFT",
"THEFT-MAIL": "LARCENY/THEFT",
"THEFT-PKPOCKET": "LARCENY/THEFT",
"THEFT-PRSNATCH": "LARCENY/THEFT",
"THEFT-SHOPLIFT": "LARCENY/THEFT",
"THREATS-KILL": "ASSAULT",
"THREATS-OTHER": "OTHER OFFENSES",
"THREATS-WEAPON": "ASSAULT",
"TRAFFIC": "OTHER OFFENSES",
"URINATING/DEFECATING-IN PUBLIC": "DISORDERLY CONDUCT",
"VEH-RCVD-FOR OTHER AGENCY": "VEHICLE THEFT",
"VEH-THEFT-AUTO": "VEHICLE THEFT",
"VEH-THEFT-MTRCYCLE": "VEHICLE THEFT",
"VEH-THEFT-OTHVEH": "VEHICLE THEFT",
"VEH-THEFT-TRAILER": "VEHICLE THEFT",
"VEH-THEFT-TRUCK": "VEHICLE THEFT",
"VIOL-COURT ORDER": "OTHER OFFENSES",
"WARRANT-FUGITIVE": "WARRANTS",
"WARRARR-FELONY": "WARRANTS",
"WARRARR-MISDEMEANOR": "WARRANTS",
"WEAPON-CONCEALED": "WEAPON LAWS",
"WEAPON-DISCHARGE": "WEAPON LAWS",
"WEAPON-POSSESSION": "WEAPON LAWS",
"WEAPON-SELLING": "WEAPON LAWS",
"WEAPON-SURRENDER-EXCLUDING FIR": "WEAPON LAWS",
"WEAPON-UNLAWFUL USE": "WEAPON LAWS",
"[INC - CASE DC USE ONLY]": "OTHER OFFENSES"
}
for category, subset in category_subset.items():
seattle_normalized.loc[seattle_normalized["Category"] == category, "Category"] = subset
sf_categories = set(sf_normalized["Category"].tolist())
seattle_categories = set(seattle_normalized["Category"].tolist())
#print "SF crime categories: %s " % ",".join(sorted(sf_categories))
#print "Seattle crime categories: %s" % ",".join(sorted(seattle_categories))
assert seattle_categories <= sf_categories
# Compare DayOfWeek of sf_normalized & seattle_normalized columns
sf_normalized[(sf_normalized['DayOfWeek'] == 'Sunday') & (sf_normalized['Category'] == 'ARSON')]
seattle_normalized[(seattle_normalized['DayOfWeek'] == 'Sunday') & (seattle_normalized['Category'] == 'ARSON')]
sf_normalized[['Description', 'DayOfWeek', 'Time', 'District', ]].head()
seattle_normalized[['Description', 'DayOfWeek', 'Time', 'District', ]].head()
sf_normalized.loc[1000, ['Description', 'Time']]
seattle_normalized.loc[1000, ['Description', 'Time']]
sf_categories
seattle_categories
# Joining process
dataset = pd.concat([sf_normalized, seattle_normalized])
total_nrows = dataset.shape[0]
#print "Total number of rows after union: %d" % total_nrows
assert total_nrows == sf_nrows + seattle_nrows
dataset.tail()
# Plotting number of reports (y) with each crime (x)
def plotdat(data, cat):
l=data.groupby(cat).size()
l.sort()
fig=plt.figure(figsize=(10,5))
plt.yticks(fontsize=8)
l.plot(kind='bar',fontsize=12,color='k')
plt.xlabel('')
plt.ylabel('Number of reports',fontsize=10)
plotdat(dataset,'Category')
There could be a number of reasons for the above graphs' outcome. One of them could be: It seems that the more trivial the crime, the more incidences are reported. Unless there is some other causation, this notion could stand. Another reason could be that the riskier the crime, the less reported incidences.
# Below we create a district function based on the types of destricts in a city
def types_districts(dataset,per):
# Group by crime type and district
hoods_per_type = dataset.groupby('Description').District.value_counts(sort=True)
t = hoods_per_type.unstack().fillna(0)
# Sort by hood sum
hood_sum = t.sum(axis=0)
hood_sum.sort(ascending=False)
t = t[hood_sum.index]
# Filter by crime per district
crime_sum = t.sum(axis=1)
crime_sum.sort()
# Large number, so let's slice the data.
p = np.percentile(crime_sum,per)
ix = crime_sum[crime_sum > p]
t = t.loc[ix.index]
return t
t=types_districts(dataset,96)
# Let's drill down onto one {Larceny/Theft because it has the most reported cases}
cat = sf[sf['Category']=='LARCENY/THEFT']
# Bin crime by 30 day window
cat['Month'] = np.floor(cat['days']/30) # Approximate month (30 day window)
# Default
district = 'All'
def timeseries(dat,per):
''' Category grouped by month '''
# Group by crime type and district
cat_per_time = dat.groupby('Month').Description.value_counts(sort=True)
t=cat_per_time.unstack().fillna(0)
# Filter by crime per district
crime_sum=t.sum(axis = 0)
crime_sum.sort()
# Large number, so let's slice the data.
p = np.percentile(crime_sum,per)
ix = crime_sum[crime_sum > p]
t = t[ix.index]
return t
t_all = timeseries(cat, 0)
t_all
t_all.describe()
It seems, as we move further down the LARCENY/THEFT category, the higher the mean and variance. This might be due to the ordering of the chosen category (from least to most reported).
# Lets use real dates for plotting
days_from_start = pd.Series(t_all.index * 30).astype('timedelta64[D]')
dates_for_plot = date.min() + days_from_start
time_labels = dates_for_plot.map(lambda x: str(x.year) + '-' + str(x.month))
time_labels
# First 10 Descriptions from t_all dataframe
t_all['THEFT OF WRITTEN INSTRUMENT'].plot(figsize = (16, 12), legend = True, color = 'blue')
t_all['ATTEMPTED THEFT FROM UNLOCKED VEHICLE'].plot(figsize = (16, 12), legend = True, color = 'brown')
t_all['ATTEMPTED GRAND THEFT FROM PERSON'].plot(figsize = (16, 12), legend = True, color = 'yellow')
t_all['ATTEMPTED PETTY THEFT OF PROPERTY'].plot(figsize = (16, 12), legend = True, color = 'black')
t_all['ATTEMPTED THEFT OF A BICYCLE'].plot(figsize = (16, 12), legend = True, color = 'green')
t_all['ATTEMPTED SHOPLIFTING'].plot(figsize = (16, 12), legend = True, color = 'blue')
t_all['THEFT OF ANIMALS (GENERAL)'].plot(figsize = (16, 12), legend = True, color = 'brown')
t_all['ATTEMPTED THEFT FROM A BUILDING'].plot(figsize = (16, 12), legend = True, color = 'yellow')
t_all['GRAND THEFT PURSESNATCH'].plot(figsize = (16, 12), legend = True, color = 'black')
t_all['ATTEMPTED THEFT OF A BICYCLE'].plot(figsize = (16, 12), legend = True, color = 'green')
# Last 10 Descriptions from t_all dataframe
t_all['GRAND THEFT PICKPOCKET'].plot(figsize = (16, 12), legend = True, color = 'blue')
t_all['GRAND THEFT FROM A BUILDING'].plot(figsize = (16, 12), legend = True, color = 'brown')
t_all['GRAND THEFT FROM PERSON'].plot(figsize = (16, 12), legend = True, color = 'yellow')
t_all['PETTY THEFT SHOPLIFTING'].plot(figsize = (16, 12), legend = True, color = 'black')
t_all['PETTY THEFT FROM A BUILDING'].plot(figsize = (16, 12), legend = True, color = 'green')
t_all['GRAND THEFT OF PROPERTY'].plot(figsize = (16, 12), legend = True, color = 'blue')
t_all['GRAND THEFT FROM UNLOCKED AUTO'].plot(figsize = (16, 12), legend = True, color = 'brown')
t_all['PETTY THEFT FROM LOCKED AUTO'].plot(figsize = (16, 12), legend = True, color = 'yellow')
t_all['PETTY THEFT OF PROPERTY'].plot(figsize = (16, 12), legend = True, color = 'black')
t_all['GRAND THEFT FROM LOCKED AUTO'].plot(figsize = (16, 12), legend = True, color = 'green')
It seems , from taking the first and last ten descriptions of category 'LARCENY/THEFT', as time goes by month to month (June-August), the number of case reports drop linearly. Mostly seem to firstly rise from June to first half of July, from there they start to drop towards zero. An obvious reason could be the changing of the seasons from summer to winter in the northern hemisphere (precisely summer to autumn). People tend to stay indoors in such times. More reasons for that are yet to be found!
# Packages needed for below analysis
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colorbar as cbar
# Let's drill down onto one Category
cat=dataset[dataset['Category']=='LARCENY/THEFT']
c=cat['Description'].value_counts()
c.sort(ascending=False)
c.head(10)
# Select data from SF and group by TimeBin (hour) and crime Category
gb = dataset[dataset["City"] == SF].groupby(["TimeBin", "Category"])
# Sorting since it is more pleasing visually
categories = sorted(list(set(dataset.Category.tolist())))
# number of categories
n_categories = len(categories)
# matrix where the x-axis is the hour, and y-axis is the category
img_src = np.zeros((n_categories, 24))
# Add the frequencies of crime per hour in the matrix
for group, values in gb:
timebin = group[0]
category = group[1]
value = values.shape[0]
img_src[categories.index(category)][timebin] = value
# Normalizing frequencies between 0 and 1 based on the max of each row
for row in img_src:
m = max(row)
for idx in range(len(row)):
row[idx] = row[idx] / m
# Plotting procedure
def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999',
gridWidth=1.0):
plt.close()
fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
ax.axes.tick_params(labelcolor='#999999', labelsize='10')
for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
axis.set_ticks_position('none')
axis.set_ticks(ticks)
axis.label.set_color('#999999')
if hideLabels: axis.set_ticklabels([])
plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
return fig, ax
# draw grid on axes
fig, ax = preparePlot(np.arange(.5, 23, 1), np.arange(.5, n_categories-1, 1), figsize=(16,14), hideLabels=True,
gridColor='#eeeeee', gridWidth=1.1)
# interpolate crime intensity per hour on the grid
image = plt.imshow(img_src, interpolation='nearest', aspect='auto', cmap=cm.Greys)
# x-axis labels
for x, y, s in zip(np.arange(-.125, 24, 1), np.repeat(-.75, 24), [str(x) for x in range(24)]):
plt.text(x, y, s, color='#999999', size='10')
# y-axis labels
for x, y, s in zip(np.repeat(-.75, n_categories), np.arange(.125, n_categories, 1), categories):
plt.text(x, y, s, color='#999999', size='10', horizontalalignment="right")
plt.title("Distribution of crimes per category depending on time of day in San Francisco", size=20, y=1.05)
plt.xlabel("Hour", color='#999999', size="20")
plt.ylabel("Crime Category", color='#999999', size="20")
ax.yaxis.set_label_position("right")
# plot the colobar to show scale
cbar = fig.colorbar(image, ticks=[0, 0.5, 1], shrink=0.25, orientation='vertical')
cbar.ax.set_yticklabels(['Low', 'Medium', 'High'])
plt.show()
There seems to be peak and off-peak times. From around 9am, most crimes begin and start reducing just after 12am, except for crimes that need nightlife activity to occur (e.g. prostitutuion, drunkedness, bribery); or burglaries that occur during sleeping hours early in the morning at about 1 - 3am.
There are other crimes that seem to be less prevalent (e.g. Stolen Property, Loittering, Gambling, Pornography/obscene). But there could be other reasons for that. Perhaps less reporting or low level of extremity could be major factors.
While looking through the results in the above graph, we kind of realized a trend. Categories that are mostly associated with white collar crimes (e.g. SUSPICIOUS OCC, SECONDARY CODES, EXTORTION, EMBEZZLEMENT, FRAUD , FORGERY/COUNTERFEITING) mostly occurred during waking hours (9am - 9pm). This led to the decision to look into that notion in the next part of the analysis.
# Packages needed for below analysis
from matplotlib.colors import Normalize
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
import pandas as pd
from descartes import PolygonPatch
import folium
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep
from itertools import chain
import os
import fiona
import matplotlib.pyplot as plt
%matplotlib inline
# Shapefile of San Francisco
SF_NEIGHBORHOODS = "SFFind_Neighborhoods"
sf_neighborhoods = os.path.join("%s.shp" % SF_NEIGHBORHOODS)
shp = fiona.open(sf_neighborhoods)
shp
# Extract map boundaries
bds = shp.bounds
shp.close()
extra = 0.01
ll = (bds[0], bds[1])
ur = (bds[2], bds[3])
coords = list(chain(ll, ur))
# width, height of map
w, h = coords[2] - coords[0], coords[3] - coords[1]
# Basemap used to view shapefiles
m = Basemap(
projection='tmerc',
lon_0=-122.,
lat_0=37.7,
ellps = 'WGS84',
llcrnrlon=coords[0] - extra * w,
llcrnrlat=coords[1] - extra + 0.01 * h,
urcrnrlon=coords[2] + extra * w,
urcrnrlat=coords[3] + extra + 0.01 * h,
lat_ts=0,
resolution='i',
suppress_ticks=True)
m.readshapefile(
os.path.join(SF_NEIGHBORHOODS),
'SF',
color='none',
zorder=2)
print "Read SF neighborhood data into Basemap"
# Polygon shaping process of neighbourhoods
df_map = pd.DataFrame({
'poly': [Polygon(xy) for xy in m.SF],
'ward_name': [ward['name'] for ward in m.SF_info]})
df_map['area_m'] = df_map['poly'].map(lambda x: x.area)
df_map['area_km'] = df_map['area_m'] / 100000
# Draw neighborhoods with polygons
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(
x,
fc='#000000',
ec='#ffffff', lw=.5, alpha=1,
zorder=4))
df_map.head()
# We're going to choose categories that represent white collar crimes
def makePoints(dat):
# Create Point objects in map coordinates from dataframe lon and lat values
map_points = pd.Series([Point(m(mapped_x,mapped_y)) for mapped_x, mapped_y in zip(dat['Longitude'],dat['Latitude'])])
plt_points = MultiPoint(list(map_points.values))
hoods_polygon = prep(MultiPolygon(list(df_map['poly'].values)))
pts = filter(hoods_polygon.contains,plt_points)
return pts
white_collar_cat = [
"SUSPICIOUS OCC",
"FORGERY/COUNTERFEITING",
"SECONDARY CODES",
"EXTORTION",
"EMBEZZLEMENT",
"FRAUD"
]
sf_wcollar_crimes = sf_normalized[
(sf_normalized["City"] == SF) &
(
(sf_normalized["TimeBin"] >= 9) |
(sf_normalized["TimeBin"] <= 21)
) &
(sf_normalized["Category"].isin(white_collar_cat))]
sf_suspicious_occ = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[0]]
sf_forgery = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[1]]
sf_secondary_codes = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[2]]
sf_extortion = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[3]]
sf_embezzlement = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[4]]
sf_fraud = sf_wcollar_crimes[sf_wcollar_crimes["Category"] == white_collar_cat[5]]
sf_wcollar_points = makePoints(sf_wcollar_crimes)
sf_sus_occ_points = makePoints(sf_suspicious_occ)
sf_forgery_points = makePoints(sf_forgery)
sf_second_codes_points = makePoints(sf_secondary_codes)
sf_extortion_points = makePoints(sf_extortion)
sf_embezzle_points = makePoints(sf_embezzlement)
sf_fraud_points = makePoints(sf_fraud)
# Plotting procedure
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# Scatter Suspicious OCC occurrences
dev = m.scatter(
[geom.x for geom in sf_sus_occ_points],
[geom.y for geom in sf_sus_occ_points],
10, marker='o', lw=.25,
facecolor='cyan', edgecolor='cyan',
alpha=0.75, antialiased=True,
label='Suspicious OCC', zorder=3)
# Scatter Forgery occurrences
dev = m.scatter(
[geom.x for geom in sf_forgery_points],
[geom.y for geom in sf_forgery_points],
10, marker='o', lw=5.5,
facecolor='red', edgecolor='red',
alpha=0.75, antialiased=True,
label='Forgery', zorder=3)
# Scatter Secondary Codes occurrences
dev = m.scatter(
[geom.x for geom in sf_second_codes_points],
[geom.y for geom in sf_second_codes_points],
10, marker='o', lw=5.5,
facecolor='green', edgecolor='green',
alpha=0.75, antialiased=True,
label='Secondary Codes', zorder=3)
# Scatter Extortion occurrences
dev = m.scatter(
[geom.x for geom in sf_extortion_points],
[geom.y for geom in sf_extortion_points],
10, marker='x', lw=.75,
facecolor='magenta', edgecolor='magenta',
alpha=0.75, antialiased=True,
label='Extortion', zorder=3)
# Scatter embezzlement occurrences
dev = m.scatter(
[geom.x for geom in sf_embezzle_points],
[geom.y for geom in sf_embezzle_points],
10, marker='o', lw=5.5,
facecolor='yellow', edgecolor='yellow',
alpha=0.75, antialiased=True,
label='Embezzlement', zorder=3)
# Scatter Fraud occurrences
dev = m.scatter(
[geom.x for geom in sf_fraud_points],
[geom.y for geom in sf_fraud_points],
10, marker='o', lw=.25,
facecolor='orange', edgecolor='orange',
alpha=0.75, antialiased=True,
label='Fraud', zorder=3)
ax.add_collection(PatchCollection(df_map['patches'].values, match_original=True))
m.drawmapscale(
coords[0] + 0.015, coords[1] - 0.005,
coords[0], coords[1],
2.,
barstyle='fancy', labelstyle='simple',
fillcolor1='w', fillcolor2='#555555',
fontcolor='#555555',
zorder=5)
plt.tight_layout()
plt.legend(loc="center right")
plt.title("Geographic distribution of most prevalent crimes at night in San Francisco during Summer 2014", size=15)
fig.set_size_inches(15,15)
plt.show()
For the above map scenario, crime reports that seem to be associated with white collar crimes were chosen.
The following above map implies a similar type of trend for white collar crimes to blue collar crimes. There seems to be similarity in terms of the location of most crime reports. The central part of the city seems to be at the heart of this scenario. But this applies to certain white collar crime reports, like Secondary Codes, Forgery, Fraud and Embezzlement.
On the other hand this could mean that the above crimes are not only associated with upper end classes, but also less previleged classes. Although, as you go further away from the city centre, the less crime reports. There seems to be another reverse trend as you move further away from the center of the city though. Clusters occur for Suspicious OCC and Secondary Codes, but not as dense as the city central. It makes sense in terms of who those types of crimes are associated with (middle-upper income classes living in the outskirts). What then?
# San Francisco 2015 crime reports data
sf_2015 = pd.read_csv('SFPD_Incidents_-_Current_Year__2015_.csv')
sf2015_nrows = sf_2015.shape[0]
sf_2015.head(1)
# Cluster map of descriptive 2014 data
from folium.plugins import MarkerCluster
#for speed purposes
n = 1000
lons=sf.Longitude[0:n]
lats=sf.Latitude[0:n]
locations = list(zip(lats, lons))
#popups = [each[1]['Category'] + ": " + each[1]['Destrict'].format(each) for each in locations]
#popups = ['{}'.format(loc) for loc in locations]
mapa = folium.Map(location=[np.mean(lats), np.mean(lons)],zoom_start=1)
mapa.add_children(MarkerCluster(locations=locations, popups=sf['District'] + ": " + sf['Category']))
mapa
# Cluster map of descriptive 2015 data
#for speed purposes
n = 1000
lats2=sf_2015.Y[0:n]
lons2=sf_2015.X[0:n]
locations2 = list(zip(lats2, lons2))
#popups = [each[1]['Category'] + ": " + each[1]['Destrict'].format(each) for each in locations]
#popups = ['{}'.format(loc) for loc in locations]
mapa2 = folium.Map(location=[np.mean(lats2), np.mean(lons2)],zoom_start=1)
mapa2.add_children(MarkerCluster(locations=locations2, popups=sf_2015['PdDistrict'] + ": " + sf_2015['Category']))
mapa2
# Cluster map of descriptive 2014 data
SF_COORDINATES = (37.76, -122.45)
#definition of the boundaries in the map
district_geo = r'sfpddistricts.geojson'
#calculating total number of incidents per district
crimedata2 = pd.DataFrame(sf['District'].value_counts().astype(float))
crimedata2.to_json('crimeagg.json')
crimedata2 = crimedata2.reset_index()
crimedata2.columns = ['District', 'Number']
#creation of the choropleth
map1 = folium.Map(location=SF_COORDINATES, zoom_start=12)
map1.geo_json(geo_path = district_geo,
data_out = 'crimeagg.json',
data = crimedata2,
columns = ['District', 'Number'],
key_on = 'feature.properties.DISTRICT',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
legend_name = 'Number of incidents per district')
map1
# Cluster map of descriptive 2015 data
#definition of the boundaries in the map
district_geo = r'sfpddistricts.geojson'
#calculating total number of incidents per district
crimedata3 = pd.DataFrame(sf_2015['PdDistrict'].value_counts().astype(float))
crimedata3.to_json('crimeagg2.json')
crimedata3 = crimedata3.reset_index()
crimedata3.columns = ['District', 'Number']
#creation of the choropleth
map2 = folium.Map(location=SF_COORDINATES, zoom_start=12)
map2.choropleth(geo_path = district_geo,
data_out = 'crimeagg2.json',
data = crimedata3,
columns = ['District', 'Number'],
key_on = 'feature.properties.DISTRICT',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
legend_name = 'Number of incidents per district')
map2
The first two maps portray a more heavily clustered Southern (2014) and Northern (2015) Destricts. In 2014, the most clustered region was the Southern destrict. Although, as 2015 commenced, that region reduced dramatically while the opposite happened for the Northern Destrict. One reason for such a trend could be less or more crime reporting for each region between two consecutive years. Another could be, because of more crime reporting in one region, increased police petrols in that region occurred, which led to the crimes moving to the next likely region with higher intensity. This could explain the higher growth in reports from one year to the next.
This is most common in most cities across the world in terms of most clustered regions with high crime reports, which are centered in and around city centers torwards the less previleged regions of that particular city. In this case, from the Central District torwards the Southern and Northern Destricts is the order of the day.
We have come to understand different types of crimes that occur in San Francisco: which could correlate with the change of seasons; those that occur during the day or night, or both; white collar crimes that cluster with regional changes; and the cluster changes between different regions from one year to the next.
Next we're going to compare similarity between different cities (San Francisco and Seattle). The similarity will distingiush whether the two cities share the same types of crimes in terms of number of reports. We decided to use a lite hierarchical clustering algorithm because in order to analyze similarities, we first have to normalize the data to prevent overfitting (high variance) and/or underfitting (low bias).
The below process will involve firstly filtering data that doesn't belong for camparison reasons (no significance between cities = no reason to compare). Secondly, we'll normalize the data for better comparing of cities. Then lastly we'll output the results with a visual for better understanding of the process.
# Filtering and normalizing process
# only keep daylight crimes
gb = dataset[(dataset["TimeBin"] >= 6) | (dataset["TimeBin"] <= 19)].groupby("Category")
cat_city = gb["City"].value_counts().unstack().fillna(0)
# filter out crimes which don't have an occurrence in both cities
cat_nonnull = cat_city[(cat_city[SF] > 0) & (cat_city[SEATTLE] > 0)]
print "Crime categories not in both cities:"
print cat_city[(cat_city[SF] == 0) | (cat_city[SEATTLE] == 0)]
# normalize the distribution of crimes so that each rows adds up to 1
cat_norm = cat_nonnull.div(cat_nonnull.sum(axis=1), axis=0)
cat_norm.head()
# Seaborn is used for visualizing above results
import seaborn as sns
g = sns.clustermap(cat_norm, annot=True, cmap="Blues", fmt="g")
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
g.fig.suptitle("Normalized distribution and clustering of crimes between San Francisco and Seattle", size=15)
plt.show()
Since white collar crime is analyzed; Suspicious OCC, Secondary Codes and Extortion don't occur in both cities, therefore were filtered out.
Here are the results for the remaining categories that are associated with white collar crimes:
Forgery/Counterfeiting: SF- 0.08 : SEATTLE- 0.92; Embezzlement: SF- 0.15 : SEATTLE- 0.85; Fraud: SF- 0.14 : SEATTLE- 0.86;
The above results signify that since the significance values don't or even closely match, while the other categories where filtered out, The two cities don't share similarities in terms of white collar crime.
On the other hand, there are similarities between the two cities that are more considered to be blue collar crimes.
Here are the results for those:
ASSAULT: SF- 0.56 : SEATTLE- 0.44; LARCENY/THEFT: SF- 0.44 : SEATTLE- 0.56; LIQUOR LAWS: SF- 0.47 : SEATTLE- 0.53;
This means that the two cities do share some similarity results in terms of crime incidences, but only a small portion of the overall list of categories. The rest of the categories share similar outcomes as the white collar categories (no or little significance).
To conclude the overall analysis of the theme at hand for the San Francisco and Seattle cities, especially San Francisco, brings about a few notions to consider. It's clear that most reported crime incidences occur in the daylight. A reason could be that most people are a sleep at night, actually maybe even most criminals.
The most populated regions have the most crime reports, and there seems to be increases of crime reports from 2014 - 2015. This could be because of income gap reasons.
Blue collar crime is the most reported and located mostly from the city center towards overpopulated and underpriveleged regions.
White collar crime also follows the same trend in terms of cluster density, but also has less dense clusters in the outskirts of the city centre. Reasons for this particular pattern could be less reporting, or, data sources being hard to reach could be THEE major player to the overall theme at hand.